High-dimensional Proximity Joins
نویسندگان
چکیده
Many emerging data mining applications require a proximity (similarity) join between points in a high-dimensional domain. We present a new algorithm that utilizes a new data structure, called the -kd tree, for fast spatial proximity joins on high-dimensional points. This data structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of nding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence the proposed data structure scales to high-dimensional data. We analyze the cost of the join for the -kd tree and the R-tree family, and show that the -kd tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life datasets, shows that proximity join using the -kd tree is typically 2 to 40 times faster than the R tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the -kd tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for highdimensional proximity joins, but do not match the performance of the -kd tree.
منابع مشابه
Parallel Algorithms for High-Dimensional Proximity Joins
We consider the problem of parallelizing highdimensional proximity joins. We present a parallel multidimensional join algorithm based on an the epsilon-kdB tree and compare it with the more common approach of space partitioning. An evaluation of the algorithms on an IBM SP2 shared-nothing multiprocessor is presented using both synthetic and real-life datasets. We also examine the effectiveness ...
متن کاملA Fast Algorithm for high-dimensional Similarity Joins
Many emerging data mining applications require a similarity join between points in a highdimensional domain. We present a new algorithm that utilizes a new index structure, called the -kdB tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of nd...
متن کاملComparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data
Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of low-dimensional vector data. In this paper, we revisit and investigate ...
متن کاملFast similarity join for multi-dimensional data
To appear in Information Systems Journal, Elsevier, 2005 The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memor...
متن کاملClass proximity measures - Dissimilarity-based classification and display of high-dimensional data
For two-class problems, we introduce and construct mappings of high-dimensional instances into dissimilarity (distance)-based Class-Proximity Planes. The Class Proximity Projections are extensions of our earlier relative distance plane mapping, and thus provide a more general and unified approach to the simultaneous classification and visualization of many-feature datasets. The mappings display...
متن کامل